{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Compare under-sampling samplers\n\nThe following example attends to make a qualitative comparison between the\ndifferent under-sampling algorithms available in the imbalanced-learn package.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(__doc__)\n\nimport seaborn as sns\n\nsns.set_context(\"poster\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following function will be used to create toy dataset. It uses the\n:func:`~sklearn.datasets.make_classification` from scikit-learn but fixing\nsome parameters.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import make_classification\n\n\ndef create_dataset(\n    n_samples=1000,\n    weights=(0.01, 0.01, 0.98),\n    n_classes=3,\n    class_sep=0.8,\n    n_clusters=1,\n):\n    return make_classification(\n        n_samples=n_samples,\n        n_features=2,\n        n_informative=2,\n        n_redundant=0,\n        n_repeated=0,\n        n_classes=n_classes,\n        n_clusters_per_class=n_clusters,\n        weights=list(weights),\n        class_sep=class_sep,\n        random_state=0,\n    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following function will be used to plot the sample space after resampling\nto illustrate the specificities of an algorithm.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def plot_resampling(X, y, sampler, ax, title=None):\n    X_res, y_res = sampler.fit_resample(X, y)\n    ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor=\"k\")\n    if title is None:\n        title = f\"Resampling with {sampler.__class__.__name__}\"\n    ax.set_title(title)\n    sns.despine(ax=ax, offset=10)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following function will be used to plot the decision function of a\nclassifier given some data.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\n\ndef plot_decision_function(X, y, clf, ax, title=None):\n    plot_step = 0.02\n    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n    xx, yy = np.meshgrid(\n        np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)\n    )\n\n    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n    Z = Z.reshape(xx.shape)\n    ax.contourf(xx, yy, Z, alpha=0.4)\n    ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor=\"k\")\n    if title is not None:\n        ax.set_title(title)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.linear_model import LogisticRegression\n\nclf = LogisticRegression()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prototype generation: under-sampling by generating new samples\n\n:class:`~imblearn.under_sampling.ClusterCentroids` under-samples by replacing\nthe original samples by the centroids of the cluster found.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\nfrom imblearn import FunctionSampler\nfrom imblearn.pipeline import make_pipeline\nfrom imblearn.under_sampling import ClusterCentroids\n\nX, y = create_dataset(n_samples=400, weights=(0.05, 0.15, 0.8), class_sep=0.8)\n\nsamplers = {\n    FunctionSampler(),  # identity resampler\n    ClusterCentroids(random_state=0),\n}\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, model, ax[0], title=f\"Decision function with {sampler.__class__.__name__}\"\n    )\n    plot_resampling(X, y, sampler, ax[1])\n\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prototype selection: under-sampling by selecting existing samples\n\nThe algorithm performing prototype selection can be subdivided into two\ngroups: (i) the controlled under-sampling methods and (ii) the cleaning\nunder-sampling methods.\n\nWith the controlled under-sampling methods, the number of samples to be\nselected can be specified.\n:class:`~imblearn.under_sampling.RandomUnderSampler` is the most naive way of\nperforming such selection by randomly selecting a given number of samples by\nthe targetted class.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.under_sampling import RandomUnderSampler\n\nX, y = create_dataset(n_samples=400, weights=(0.05, 0.15, 0.8), class_sep=0.8)\n\nsamplers = {\n    FunctionSampler(),  # identity resampler\n    RandomUnderSampler(random_state=0),\n}\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, model, ax[0], title=f\"Decision function with {sampler.__class__.__name__}\"\n    )\n    plot_resampling(X, y, sampler, ax[1])\n\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ":class:`~imblearn.under_sampling.NearMiss` algorithms implement some\nheuristic rules in order to select samples. NearMiss-1 selects samples from\nthe majority class for which the average distance of the $k$` nearest\nsamples of the minority class is the smallest. NearMiss-2 selects the samples\nfrom the majority class for which the average distance to the farthest\nsamples of the negative class is the smallest. NearMiss-3 is a 2-step\nalgorithm: first, for each minority sample, their $m$\nnearest-neighbors will be kept; then, the majority samples selected are the\non for which the average distance to the $k$ nearest neighbors is the\nlargest.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.under_sampling import NearMiss\n\nX, y = create_dataset(n_samples=1000, weights=(0.05, 0.15, 0.8), class_sep=1.5)\n\nsamplers = [NearMiss(version=1), NearMiss(version=2), NearMiss(version=3)]\n\nfig, axs = plt.subplots(nrows=3, ncols=2, figsize=(15, 25))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X,\n        y,\n        model,\n        ax[0],\n        title=f\"Decision function for {sampler.__class__.__name__}-{sampler.version}\",\n    )\n    plot_resampling(\n        X,\n        y,\n        sampler,\n        ax[1],\n        title=f\"Resampling using {sampler.__class__.__name__}-{sampler.version}\",\n    )\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ":class:`~imblearn.under_sampling.EditedNearestNeighbours` removes samples of\nthe majority class for which their class differ from the one of their\nnearest-neighbors. This sieve can be repeated which is the principle of the\n:class:`~imblearn.under_sampling.RepeatedEditedNearestNeighbours`.\n:class:`~imblearn.under_sampling.AllKNN` is slightly different from the\n:class:`~imblearn.under_sampling.RepeatedEditedNearestNeighbours` by changing\nthe $k$ parameter of the internal nearest neighors algorithm,\nincreasing it at each iteration.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.under_sampling import (\n    EditedNearestNeighbours,\n    RepeatedEditedNearestNeighbours,\n    AllKNN,\n)\n\nX, y = create_dataset(n_samples=500, weights=(0.2, 0.3, 0.5), class_sep=0.8)\n\nsamplers = [\n    EditedNearestNeighbours(),\n    RepeatedEditedNearestNeighbours(),\n    AllKNN(allow_minority=True),\n]\n\nfig, axs = plt.subplots(3, 2, figsize=(15, 25))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, clf, ax[0], title=f\"Decision function for \\n{sampler.__class__.__name__}\"\n    )\n    plot_resampling(\n        X, y, sampler, ax[1], title=f\"Resampling using \\n{sampler.__class__.__name__}\"\n    )\n\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ":class:`~imblearn.under_sampling.CondensedNearestNeighbour` makes use of a\n1-NN to iteratively decide if a sample should be kept in a dataset or not.\nThe issue is that :class:`~imblearn.under_sampling.CondensedNearestNeighbour`\nis sensitive to noise by preserving the noisy samples.\n:class:`~imblearn.under_sampling.OneSidedSelection` also used the 1-NN and\nuse :class:`~imblearn.under_sampling.TomekLinks` to remove the samples\nconsidered noisy. The\n:class:`~imblearn.under_sampling.NeighbourhoodCleaningRule` use a\n:class:`~imblearn.under_sampling.EditedNearestNeighbours` to remove some\nsample. Additionally, they use a 3 nearest-neighbors to remove samples which\ndo not agree with this rule.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.under_sampling import (\n    CondensedNearestNeighbour,\n    OneSidedSelection,\n    NeighbourhoodCleaningRule,\n)\n\nX, y = create_dataset(n_samples=500, weights=(0.2, 0.3, 0.5), class_sep=0.8)\n\nfig, axs = plt.subplots(nrows=3, ncols=2, figsize=(15, 25))\n\nsamplers = [\n    CondensedNearestNeighbour(random_state=0),\n    OneSidedSelection(random_state=0),\n    NeighbourhoodCleaningRule(),\n]\n\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, clf, ax[0], title=f\"Decision function for \\n{sampler.__class__.__name__}\"\n    )\n    plot_resampling(\n        X, y, sampler, ax[1], title=f\"Resampling using \\n{sampler.__class__.__name__}\"\n    )\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ":class:`~imblearn.under_sampling.InstanceHardnessThreshold` uses the\nprediction of classifier to exclude samples. All samples which are classified\nwith a low probability will be removed.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.under_sampling import InstanceHardnessThreshold\n\nsamplers = {\n    FunctionSampler(),  # identity resampler\n    InstanceHardnessThreshold(\n        estimator=LogisticRegression(),\n        random_state=0,\n    ),\n}\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X,\n        y,\n        model,\n        ax[0],\n        title=f\"Decision function with \\n{sampler.__class__.__name__}\",\n    )\n    plot_resampling(\n        X, y, sampler, ax[1], title=f\"Resampling using \\n{sampler.__class__.__name__}\"\n    )\n\nfig.tight_layout()\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}